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Abstract: Morgan, Hodge, Trepinski, and Anderson (2014) have written an article that continues to 
confirm what we have known for some time—teacher effects on student achievement have limited 
stability. In this commentary, we address the other potential contributions this work can make to 
inform practice, policy, and research. While illustrating Morgan et al.’s inattention to history, we take 
the opportunity to reframe their findings. Considering the authors’ work in the context of past and 
current research, we illustrate that this collective set of stable evidence should convince policymakers 
that it is «<?/reasonable to assume that teachers and teaching is stable across time. Beyond this 
important opportunity to influence policy, we believe these findings underscore the need to build 
upon and expand the dependent measures we use to define and understand good teaching. After all, 
as we have noted (Lavigne & Good, 2014; in press) good teaching involves much more than 
increasing students’ scores on standardized achievement tests. 
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Los problemas de estabilidad del desempeno docente no son nuevos: Limitaciones 
y posibilidades. 

Resumen: Morgan, Hodge, Trepinski, y Anderson (2014) han escrito un artlculo que sigue 
confirmando lo que hemos sabido sobre la estabilidad limitada que los efectos docentes tienen en los 
logros de los estudiantes. En este comentario, queremos senala otras contribuciones potenciales de 
este trabajo que podrian informar la practica, polltica e investigation educativa. Mientras que 
senalamos que Morgan y colegas no prestaron suficiente atencion a la historia, tenemos la 
oportunidad de replantear sus hallazgos. Teniendo en cuenta el trabajo de los autores en el contexto 
de investigaciones pasadas y actuales, proponemos que este conjunto colectivo de pruebas estable 
debe convencer a los politicos de que no es razonable suponer que los docentes y la ensenanza es 
estable a traves del tiempo. Mas alia de esta importante oportunidad de influir en la polltica, creemos 
que estos resultados ponen de relieve la necesidad de aprovechar y ampliar las medidas dependientes 
que usamos para definir y entender la buena ensenanza. Despues de todo, como hemos senalado 
(Lavigne & Good 2014; en prensa) buena ensenanza implica mucho mas que el aumento de 
calificaciones de los estudiantes en pruebas estandarizadas 

Palabras clave: polltica de evaluation docentes; estabilidad del profesorado; efectos docentes; 
desempeno docente 

Problemas de estabilidade de desempenho dos professores nao sao novos: Limita§oes e 
possibilidades. 

Resumo: Morgan, Hodge, Trepinski, e Anderson (2014) escreveram um artigo que confirma o que 
ja sabemos sobre a estabilidade limitada dos efeitos que os professores tern sobre as avanyos dos 
estudantes. Nese comentario, observamos outras contribuiySes potenciais do presente trabalho, que 
poderia informar a pratica, polltica e pesquisa educacional. Enquanto notamos que Morgan e seus 
colegas nao prestaram atenyao suficiente para a historia, temos a oportunidade de repensar as suas 
conclusoes. Considerando-se o trabalho dos autores no contexto do passado e atual da pesquisa, 
propomos que este conjunto coletivo de provas estaveis deveria convencer os politicos de que nao e 
razoavel supor que os professores e o ensino e estavel ao longo do tempo. Alem deste importante 
oportunidade de influenciar a polltica, acreditamos que estes resultados destacam a necessidade de 
constmir e expandir as medidas que usamos para definir e compreender o que e o bom ensino. 
Afinal, como ja observamos (Lavigne & Good 2014; no prelo) bom ensino envolve muito mais do 
que o aumento na pontua^ao em testes padronizados 

Palavras-chave: avaliayao da polltica educacional; estabilidade dos professores; fins educativos; 
desempenho dos professores 


Introduction 

Morgan, Hodge, Trepinski, and Anderson (2014) have written a useful article providing yet 
more data to demonstrate both the low stability of teacher performance and teacher effectiveness 
over time. Their work assumes additional potential \alue in that it studied the effects of 132 teachers 
over five years using measures of teacher performance (observation ratings) and teacher 
effectiveness (standardized achievement tests), with assessments in multiple curriculum subjects. 
The authors found a weak relationship between an observational measure of teaching performance 
(in this case the TAP observation system, see Jerald & Van Hook, 2011) and standardized measures 
of student achievement. This complements others’ recent reports of low correlations between other 
observational measures and student achievement (Cohen, in press; Kane, Kerri, & Pianta, 2014) and 
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evidence that teacher actions are not very stable even from lesson to lesson (Patrick & 
Mantzicopoulos, 2014). This collective set of stable evidence should convince policymakers that the 
assumptions imbedded in Race to the Top about the easy use of observational and achievement data 
to evaluate teaching are faulty. Identifying effective teachers is more difficult than they believed. 

Having acknowledged the authors’ contribution to the literature, we express our 
disappointment with their inattention to history. We believe that their ahistorical framing of the 
issue is misleading and this presentation limits the value of their contribution. The authors start their 
abstract with the observation that, “The last five to ten years has seen a renewed interest in the 
stability of teacher behavior and effectiveness” (Morgan et al., 2014, p. 1). And they restate their 
position in the introductory part of the article. Yet, the reader is never told what previous research 
had found about the stability of teacher behavior and effectiveness. This is like starting in the middle 
of a book! A better and more logical starting point for the authors would have been the provision of 
a brief review of the research on teacher stability—what preceded this renewed interest in the 
stability of teacher effects? If the authors had reviewed earlier literature, they would have found that 
their major finding was known long before they conducted their research. And access to this 
information is not difficult to obtain. For example, Konstantopoulos (2014) noted that issues of 
teacher stability have been studied for decades and provided a brief and effective review of this 
literature. He noted that stability of teaching has been studied in different ways including the degree 
of stability of teacher effects: when a teacher teaches the same material to a different group of 
students (e.g., Rosenshine, 1970), across instructional periods in the same year (e.g. Emmer, 
Evertson, & Brophy, 1979; Rosenshine, 1970), and when a teacher teaches different classes of 
students over time (e.g., Brophy, 1979). Across these varied contexts, Konstantopoulos (2014) 
concluded that the stability of teacher effects was low. Other earlier researchers also had reported 
that teacher effects on students’ achievement were not highly stable. For example, Berliner (1976) 
reported, “Our own research, just completed, involved about 200 elementary school teachers, each 
of which taught a 2-week, specially designed teaching unit in reading and mathematics. Residual gain 
scores for each subject matter were calculated. These measures of effectiveness using different 
content and the same students were correlated. From these data we found that measures of 
effectiveness in the two curriculum areas correlated about .30” (p. 379). 

Brophy (1973) studied 165 elementary school teachers’ stability of residual gain estimates of 
their impact on student achievement over 3 years and found that roughly 14% of teachers had high 
effects on students over 3 years 14% had low effects for three consecutive years. Further, some 
teachers showed linear increases or decreases over time and 49% of teachers’ residual gain scores 
were inconsistent over time. Elsewhere, Good and Grouws (1977) studied over 100 third and fourth 
grade teachers’ stability over time on the Iowa test of Basic Skills. They found even lower levels of 
stability than had Brophy when they considered teacher residual gains over time across all math 
subtests. 

We acknowledge that Morgan et al. (2014) provided an important replication with a rich data 
set. However, we are uncertain whether or not they are aware of their replication. We have no basis 
for concluding why the authors did not mention the easily available historical research on teacher 
stability. Perhaps they were not aware of previous research or thought it unimportant? Perhaps they 
felt that they had insufficient space to acknowledge it? Or perhaps they thought that framing the 
problem without any history made their argument “fresher” or more unique? Or they may have felt 
that their replication would be viewed as less important than a “new” contribution. This is possible 
as, historically, replications have been perceived as less valuable (for a review of bias towards 
replications, see Makel & Pluker, 2014). Yet, we and others contend otherwise. The argument for 
replication has been made exceedingly well by Makel and Plucker (2014). They write, “If education 
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research is to be relied upon to develop sound policy and practice, then conducting replications on 
important findings is essential to moving toward a more reliable and trustworthy understanding of 
educational environments” (p. 313). Makel and Plucker (2014) have strongly recommended that 
factual evidence should be valued over putative novelty. We agree, and this is especially the case 
when new and replicating factual information can be linked to important policy decisions/actions. If 
we consider the increasing changes in the curriculum, diversity of student populations, changes in 
the teaching force, and arguably better statistical measures of teachers’ impact on student learning 
(i.e., value-added estimates rather than residual gain scores), it would seem important to determine if 
earlier research describing the low stability of teacher effects was still accurate. In other words, given 
these changes, is teacher performance and effectiveness now more stable than in another era? This is 
a timely and important question as states are using teacher evaluation data to make hire, fire, and 
grant tenure to teachers. 

At the same time, had the authors acknowledged that the field has known for decades that 
teaching effects on student achievement have limited stability, they would have framed their research 
focus more accurately and appropriately. And, had they examined stability in other professions (e.g., 
coaching), they would have found that expectations for stable performance over time cannot easily 
be assumed. For example, an examination of the top football teams illustrates that fluctuation in 
performance is considerable even week to week. Table 1 illustrates how much change occurred in 
the top ranked football teams over time and in one week. 
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Table 1 


A Comparison of the 25 Top Football Teams as Ranked by Associated Press at Preseason, Week 12, 
and Week 13, 2014: How Stable are Top Teams Over Time? _ 


Team 


Ranking* 


Preseason 

Week 12 

Week 13 

Florida State 

1 

2 

1 

Alabama 

2 

4 

2 

Oregon 

3 

3 

3 

Oklahoma 

4 

— 

23 

Ohio State 

5 

8 

7 

Auburn 

6 

9 

16 

UCLA 

7 

14 

11 

Michigan State 

8 

12 

10 

South Carolina 

9 

— 

— 

Baylor 

10 

6 

6 

Stanford 

11 

— 

— 

Georgia 

12 

16 

9 

LSU 

13 

20 

— 

Wisconsin 

14 

22 

14 

use 

15 

— 

24 

Clemson 

16 

18 

— 

Notre Dame 

17 

15 

— 

Mississippi 

18 

10 

8 

Arizona State 

19 

7 

13 

Kansas State 

20 

13 

12 

Texas A&M 

21 

— 

— 

Nebraska 

22 

11 

21 

North Carolina 

23 

— 

— 

Missouri 

24 

— 

19 

Washington 

25 

— 

— 

Mississippi State 

— 

1 

4 

TCU 

— 

5 

5 

Arizona 

— 

17 

15 

Duke 

— 

19 

25 

Marshall 

— 

21 

18 

Colorado State 

— 

23 

22 

Georgia Tech 

— 

24 

17 

Utah 

— 

25 

20 


Note. * The symbol, “—” signifies not ranked. 


Clearly, these data indicate that stability of football performance shows noticeable change 
from the preseason to week 12. Notably 8 teams ranked in the preseason top 25 were no longer 
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ranked at week 12, and 8 teams ranked in week 12 were not ranked in the preseason. Some of these 
fluctuations were dramatic as teams ranked 4, 9, 11, and 15 in the preseason were not ranked at 
week 12. Further, in week 12 the teams ranked 1 st and 5 th were not ranked in the preseason. Even in 
one week, week 12 to 13, there is variation both within the week and over time. For example, 
Oklahoma who was ranked 4 in the preseason fell out of the ratings in the 12 th week, and returned 
as the 23 rd team in week 13. And as we submit our paper on November 21 we suspect other 
evidence of unstable performance will appear in week 14 and beyond. 

Applying these data to high stakes testing of teachers we suggest that one could consider the 
preseason ranking as a “valued added” score. The expert raters take into consideration the quality of 
returning and new player talent/achievement potential, the schedule, and other factors and then 
rank the teams in terms of expected performance ranks. Should coaches at schools like South 
Carolina, Stanford, Texas and Washington be fired because their talented teams did not live up to 
expectations? Should coaches at schools like TCU, Arizona, Georgia Tech, Utah, Colorado State and 
Duke be awarded bonuses and tenure because they achieved more with teams who were judged to 
have less talent? 

Konstantopoulos (2014) has noted that stability of performance varies markedly both in 
athletics and business. Sobolevska (2014) has examined the performance of the top 200 professional 
golfers in America and their stability over six consecutive years. She concluded that the stability of 
golfers is very low and many top golfers do not appear on the list from one year to the next. 
Generally speaking, it would seem that the conditions that surround professional golfing are quite 
standard in comparison to the conditions that teachers face. Professional golfers use comparable 
equipment, and the rules for determining successful performance are vastly clearer than for teachers. 
Although conditions change from course to course on a given day, top golfers playing in the same 
tournament face the same course and the same weather conditions. Teachers on the other hand, face 
conditions that are inherently complex and unequal. Some teachers are teaching primarily talented 
students or teaching students who are from low-income homes and who often have not received 
appropriate preschool opportunities and or educational challenges. Some students come to school 
rested and well prepared for instruction, whereas other students arrive at school tired, hungry, and 
unprepared. Morgan et al.’s framing of the research question implicitly suggests that it is reasonable 
to assume that teacher performance and effects are stable. We find this assumption misleading if not 
erroneous. 

Failure to Report History: A Lost Opportunity to Influence Policy 

Policymakers hold extraordinary and often exaggerated beliefs about what teachers can and 
cannot accomplish. We (Lavigne & Good, in press) and others (Berliner, 2014; Biddle, 2014) have 
noted, teachers certainly influence achievement but that the effects of poverty on student 
achievement are huge as do data from the Program for International Student Assessment (PISA) 
show. Data from the PISA assessment illustrates that countries that distribute resources to schools 
equally are those that have higher student achievement. Countries like the US that distribute 
resources unequally across schools have lower student achievement. The PISA data show that low 
SES accounts for about 20% of the variance in student achievement scores. Although the authors 
cannot provide an exhaustive historical context for everything they discuss, we do think it would 
have been reasonable for them to have briefly explored the question of how much can teachers be 
expected to overcome in a given year and to have questioned the extent to which teachers can be 
expected to have stable effects given the fluctuations and their own lives (Spencer, 1986) and in the 
lives of students they teach. 
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Yet, generally, the research community, like these authors, was silent and did not question 
initially the easy assumptions when policymakers and Race to the Top (RttT) advocates advanced 
simplistic strategies for conducting high stake teacher evaluations by using classroom observation 
and student achievement data. Previous research provided many caveats about the difficulty of 
linking classroom process and student achievement (Brophy & Good, 1986; Everston & Green 
1986; Good & Brophy, 2008). Although talking “truth to power” may not have slowed the powerful 
forces (including RttT and substantial funding from the Bill and Melinda Gates Foundation) 
clamoring for evaluating teachers on performance and outcome measures, but, it would have been 
worth the effort. What if educational researchers early in the formation of RttT had pointed out that 
previous research provided clear knowledge that teacher performance and effectiveness was not 
generally highly stable over consecutive years? Could this stance have led to reasoned debate on 
issues such as - are today's observational and statistical techniques sufficiently better (contain less 
measurement error) and more capable of linking teaching and learning now than they were 
previously? 

The history of educational reform has been one of consistent failure as the field moves from 
fad to fad (Cuban, 2013; Good & Braden, 2000; McCaslin, 1996; Payne, 2010; Ravitch, 2010; Tyack 
& Cuban, 1977). Over time policy makers have repeatedly implemented simple solutions for reform 
based upon little if any research. These reforms are costly and waste enormous amounts of time and 
money. These solutions are quickly abandoned because they do not provide immediate answers to 
complex problems. But, after these failures, yet again another simple but costly reform appears. We 
have reviewed these issues and the interested reader can obtain our detailed arguments elsewhere 
(Lavigne & Good, 2014, in press). But briefly, Lavigne and Good (2014) characterized the failure as 
a series of steps. Each crisis is an acute concern about students’ poor achievement on standardized 
tests. Each crisis is based upon the performance of American students in comparison to their 
international peers. The new general solution is based on the rejection of the status quo and a call 
for something new and notably different from current practice. In time data suggest that the new 
reform has not solved the problem and new reforms are eventually sought. Typically the 
professional research community has not actively spoken against problematic reforms in a timely 
fashion. Ultimately, again American policymakers embark on costly new endeavors even without 
carefully defining what the new movement involves. In reviewing these characteristics of failed 
reform, we believe that had Morgan et al. (2014) missed an important opportunity to make a modest 
attempt at challenging these chronic patterns of failed reform. We contend that it is more reasonable 
for research to precede rather than to follow reform efforts. Further, we also believe that when 
research is available it should be recognized and used. 

Measuring Good Teaching: Another History Lesson 

Much of the current reform rhetoric is based on the assumption that we can measure good 
teaching. What is good teaching and can it be captured with any observational instmment? Within 
the context of this article, it seems the authors assume that appropriate teaching involves increasing 
student performance on standardize achievement tests. And, the authors persuasively note the 
importance of validating observational instruments (and even discuss how to do it), yet, they have 
little to say about the validity of the TAP program/instmment that they used in their study. The 
authors only address this issue (on page 6) when they quote Jerald and Van Hook (2011, p. 4), that 
the instrument’s ‘indicators provide sufficient breadth to ensure that evaluation ratings reflect the 
kind of effective instructional practices that predict positive learning outcomes. Unfortunately, the 
authors do not provide a reference for Jerald and Van Hook (2011); however, an examination of 
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that source does not provide a clear demonstration that the instrument is predictive of value-added 
achievement. 

Readers are still left wondering, how does the TAP measure good teaching and how/if/why 
the TAP can be considered an appropriate measure of good teaching? These appear to be important 
issues to address, particularly because Morgan et al. (prior to describing the observational procedures 
they used in the study) note, “.. .there is remarkably little research to guide such critical decisions as 
which teachers to hire, retain, remunerate, and promote” (Rice, 2003). 

It is useful to note that a conclusion remarkably similar to Rice’s was expressed fifty years 
earlier by a committee appointed by the American Educational Research Association. 

The simple fact of the matter is that, after forty years of research on teacher 
effectiveness during which a vast number of studies have been carried out, one can 
point to few outcomes that a superintendent of schools can safely employ in hiring a 
teacher or granting him tenure, that an agency can employ in certifying teachers or 
that a teacher-education faculty can employ in planning or improving teacher- 
education programs. (AERA, 1953, p. 657). 

So we and readers are left with a least two questions. First, did the field know anything more about 
good teaching when Rice wrote in 2003 than was known in 1953? Similarly, do we know any more 
now in 2014 than we did in 1953? Simply put, if we are to design an observational system believed to 
be predictive of student achievement, we need to have some knowledge based on theory and 
research that links teaching to higher or lower levels of student achievement and some reason to 
believe that the observational system we use includes those key teacher actions. Given that the 
authors used Rice’s conclusions that the field has limited capacity for making critical decisions such 
as teacher retention, it seems important, to ask why it is plausible to assume now that observational 
systems generally and the TAP specifically can be used for high-stakes decisions. Clearly, the authors 
were willing to at least implicitly accept the belief that TAP successfully captures the teacher actions 
that lead to higher value added scores. 

Other Opportunities to Inform Practice, Policy, and Research 

We now turn to consider the value of Morgan et al.’s work (2014) in the context of the brief 
history that we have provided. We comment on additional contributions this work can make today 
to inform policy, practice, and research as they further explore their data. 

Study Participants 

It is not clear why researchers did not include teachers who changed grades or schools. 
Obviously, they could not be included in the overall quantitative analyses, but an examination of 
teachers who moved would seemingly provide useful descriptive information. For example, were 
teachers who changed grades/schools more or less stable than those who stayed in their same 
context? Should we “anticipate” or account for instability (of teacher actions or teacher effects on 
students) as a natural function of adjusting to a new setting? Did teachers who moved from an 
elementary school setting to a middle school setting (or vice versa) have higher or lower effects? 

That is, there is considerable evidence to suggest that teachers in elementary schools are rated higher 
on observational measures than are teachers in middle schools (Mihaly & McCaffrey, 2014). Were 
these findings also obtained in this research? Further, it is not clear whether subject matter was an 
important mediator of teacher ratings (were teachers rated higher in math than in reading)? 
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Measures of Effective Teaching 

The above points raise an additional question: How were 5 year of data over multiple years 
and subjects combined? It is not clear how the authors combined their data to show the effects of 
teachers on students. We are told that each year grade 3-8 students were administered the Palmetto 
Assessment of State Standards. We also know that students in grades 4-7 were administered science 
and social studies tests. So when the authors discuss the performance of a teacher for whom we 
have multiple subjects, is the performance an aggregation of multiple tests over years? We are not 
suggesting that the data analysis is inappropriate; however we do suggest that in our reading of the 
article it was not clear to us how the data analysis was conducted for teachers in grades 4-8. 

Observers 

More information about coders and their training and deployment would be helpful in 
understanding the research methods. For example, the authors described the observations as 
“expert” observations. Given the current focus on high stakes testing, expert observers often 
suggest the use of highly trained external observers who have passed rigorous training. However, we 
are informed that the observers were school administrators, mentor teachers, and master teachers. 

It is not clear what criteria were used for defining master teachers or who made those designations. 
How were they trained on the observation instrument and were they familiar with using the TAP 
prior to this research? How was coder drift accounted for (seemingly in an article dealing with 
stability of teacher performance, we might expect some discussion of coder stability)? Were teachers 
observed by different observers? And, is it possible that who is doing the observation may be more 
important to reliability than how many observations were conducted? The authors do address this 
issue, but only briefly and we think more information would help readers to understand better their 
observational procedures. Further, it is not clear why observational results were rounded. There may 
be good reason for doing this, but without more information it seems that this decision would limit 
variation and potentially restrict the correlation between observation data and value-added data. 

The Importance of Context 

The authors provide commentary from other researchers that likely mediate and explain the 
lack of stability of performance and effectiveness and the low relationship between teacher actions 
and achievement. They also add to this discussion by noting, “Finally, there are contextual 
differences, such as grade level, subject matter, and classroom size and composition” (p. 14). We 
agree with the authors that there is substantial evidence that these are important considerations. 

And, for an especially thoughtful discussion of these context issues, see Berliner (2014). However, 
since the authors have grade level and subject matter data, why not discuss these possibilities in their 
own data set? 

The authors also place considerable emphasis upon the fact that observational ratings for 
teachers became higher over time, but it is not clear that this “instability” was not actually a function 
of teachers. For example, there is some evidence that teachers increase in their performance during 
the first few years of teaching, but this evidence is not reviewed in the paper nor is any consideration 
given to the fact that related research suggests that new teachers become more effective at least for 
the first few years. Furthermore, the authors suggest that in general observational ratings were 
higher than value added performance. However, they note a context finding suggesting that in 
average or above average performing schools, observation ratings and value added ratings were 
similar. But that in poor performing schools, teachers had higher mean observational ratings than 
value added ratings. The authors conclude, “This suggests that on average the observational ratings 
may over estimate teacher effectiveness in lower performing schools.” (p. 12). However, this 



Education Policy Analysis Archives Vol. 23 No. 2 


10 


conclusion appears to be based on a general average and it is not clear whether this was true equally 
across all subjects. Further, there are alternate explanations for this finding which are not explored 
including the possibility that teachers in low performing schools were actually scoring higher on the 
TAP because they were actually performing the behaviors better over time that this instmment 
measures. 


Concluding Remarks 

We end our remarks with a question the authors eventually raise. Is it reasonable to assume 
that teachers and teaching should be stable? This is a central issue and one that policy makers have 
spent little time addressing. Today’s simple policy orientation is that we can identify good teachers 
and reward them and identify poor teachers and remediate them or terminate them. Unfortunately, 
policymakers have not considered these assumptions carefully including the lack of teacher stability 
over consecutive years (as discussed here). When we apply this knowledge to simulations of high- 
stakes decision-making, a significant number of teachers are misclassified (Guarino, Reckase, & 
Wooldridge, 2012; Schochet & Chiang, 2010). Effective teachers are fired and ineffective ones and 
rewarded. The costs of these misclassifications to teachers, schools, students are insurmountable. If 
researchers want to understand those teacher actions that relate to student achievement, they need 
to be very sure that they are studying teachers who have stable performance and effects. However, 
from research presented by Brophy in 1973 to research presented by Berliner in 2014, we know that 
such teachers are not common (recall the Brophy, 1973, reported that 14% of highly effective 
teacher and 14% of low effective teachers held their rating over 3 consecutive years). 

Berliner (2014) aptly summarizes the issue of teacher stability, 

Although hard to ferret out in their “pure” form as an independent main effect, 
teacher effects on student achievement exist, and they are likely to be strong enough 
for us all to worry about who teaches our children and what their training has been. 

There does seem to be a small percentage of teachers who show consistency no 
matter what classroom and school compositions they deal with. Those few teachers 
who have strong and consistent positive effects on student outcomes, we should 
learn from and reward. And, those few teachers who have strong negative effects on 
student outcomes need to be helped or removed from classrooms. But the 
fundamental message from the research is that the percentage of such year-to-year, 
class-to-class, and school-to-school effective and ineffective teachers appears to be 
much smaller than is thought to be the case. (p. 27) 

We agree. Perhaps one direction for future research is an examination of the patterns that Berliner 
(2014) addresses, with a focus on when and why stability should be expected. Given that the 
majority of teachers do not fall into stable patterns, as traditionally defined, the field might benefit 
from further considering our expectations for reasonable stability of professional practice. Further, 
future research should build upon and expand the dependent measures we use to define/understand 
good teaching. After all, good teaching involves much more than increasing students’ scores on 
standardized achievement tests. Good teaching includes helping students to become better problem 
finders and problem solvers, as well as encouraging student civility, social responsibility, and much 
more (Lavigne & Good, in press). It is prudent to recall that as recently as the late 60’s teachers were 
not considered to have much impact on students’ achievement and that students’ success in schools 
was primarily determined by student and family variables. If we look using good research 
procedures, we may well find evidence that some aspects of teaching and their consequences on 
students are more enduring than teacher effects on standardized achievement scores. 
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